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ABSTRACT 

Summary: Recently developed methods that couple next-generation 
sequencing with chromosome conformation capture-based tech- 
niques, such as Hi-C and ChlA-PET, allow for characterization of 
genome-wide chromatin 3D structure. Understanding the organization 
of chromatin in three dimensions is a crucial next step in the unraveling 
of global gene regulation, and methods for analyzing such data are 
needed. We have developed HiBrowse, a user-friendly web-tool con- 
sisting of a range of hypothesis-based and descriptive statistics, using 
realistic assumptions in null-models. 

Availability and implementation: HiBrowse is supported by all major 
browsers, and is freely available at http://hyperbrowser.uio.no/3d. 
Software is implemented in Python, and source code is available for 
download by following instructions on the main site. 
Contact: jonaspau@ifi.uio.no 

Supplementary Information: Supplementary data are available at 
Bioinformatics online. 
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et aL, 2012), and the HOMER software suit (Heinz et aL, 2010), 
which both allow for noise-removal, outlier detection and com- 
partment identification. The HOMER software additionally 
allows for identification of significant interactions in a given 
dataset, assuming a binomial distribution and a background 
model taking into account sequence-based and compartmental 
biases. 

The global nature of these data allow for other types of stat- 
istical investigations beyond detecting significance of individual 
interactions. A common type of analysis is to analyze a set of 
genomic elements (genes, regulatory elements, transcription fac- 
tors, etc.), and ask how this subset, or 'query track', is spatially 
arranged in 3D space as represented by a Hi-C dataset, for ex- 
ample. Here we present HiBrowse, a web-based analysis server 
for performing statistical analysis of 3D genomes in a range of 
different settings. The available statistics provide a flexible and 
expandable catalog of tools based on state-of-the-art statistical 
methods utilizing Monte Carlo (MC) and analytic methods as 
suited, in addition to a range of tools for visualization and hy- 
pothesis-generating investigations. 



1 INTRODUCTION 

Methods for detection of genome-wide chromatin 3D conform- 
ation, such as Hi-C (Lieberman-Aiden et aL, 2009) and ChlA- 
PET (Fullwood et aL, 2009), are drastically expanding our 
understanding of genome biology. However, statistical and com- 
putational methods to analyze chromatin conformation capture- 
based data are needed. Many of the available methods focus on 
data visualization, or are not suited for genome- wide statistical 
investigations (Bau et aL, 2010; Servant et aL, 2012; Thongjuea 
e t aL, 20\3; Zhou et aL, 20\3). The structure of chromatin makes 
statistical analysis complicated, due to correlations between the 
interaction frequencies caused by both sequence-dependent and 
topological constraints (Paulsen et aL, 2013). A few statistical 
tests have been proposed, with varying possibilities to account 
for structural dependencies (Botta et aL, 2010; Kruse et aL, 2013; 
Paulsen et aL, 2013; Wang et aL, 2013; Witten and Noble, 2012). 
Two useful command-line tools are the hiclib-package (Imakaev 
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2 FEATURES AND METHODS 

2.1 Data representation and analysis framework 

We build on general software components of the Genomic HyperBrowser 
(Sandve et aL, 2010, 2013), a web-based analysis server for genome-scale 
data. The graphical user interface (GUI) is based on Galaxy (Goecks 
et aL, 2010), a user-friendly point-and-click environment familiar to 
many researchers. All tracks are based on a representation of elements 
as mathematical objects, consisting of points, segments, functions and 
variants of these [see Gundersen et aL (2011) for an in-depth discussion]. 
Any given analysis can be performed on all chromosomes, specific 
chromosomes or selected sub-parts of chromosomes, depending on the 
needs. 

In practice, an analysis is initiated by selecting one or more tracks either 
from the HyperBrowser repository, or from the user history. At least one 
of the selected tracks must be a Hi-C (3D) track, and the accompanying 
selected tracks (called 'query tracks') determine the types of statistical 
analyses that are possible, and therefore selectable in the system. 

A range of publicly available 3D-datasets have been installed in the 
repository. Since it has been shown that Hi-C and similar data can con- 
tain systematic biases, all the available Hi-C datasets have been corrected 
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for such biases using the method of Imakaev et al. (2012). Furthermore, a 
specialized tool has been developed to allow users to upload their own 
Hi-C data (or similar) into the history, even if the dataset itself does not 
conform to well-known formats. See Supplementary Table SI for a list of 
already installed and pre-processed Hi-C datasets. 

2.2 Overview of statistical methods 

Statistical tools are divided into two broad categories: hypothesis tests 
and descriptive statistics. Hypothesis tests are both MC based and ana- 
lytical. Due to the complex structure of chromatin conformation capture 
data, finding suited explicit null distributions is generally not possible 
(Paulsen et al, 2013; Witten and Noble, 2012), and even randomization 
of the data through MC is difficult. Therefore, we consistently perform 
permutations on the query track only. The hypothesis tests can be divided 
into three types, defined by the query track type, as illustrated in 
Figure lA. For example. Points (P) are used to analyze general (all- 
versus-all) 3D co-localization by specifying a set of genomic elements 
using the BED format, while Linked Points (LP) are used to analyze 
3D co-localization between selected pairs of elements by providing add- 
itional information about which genomic elements that should be linked 
together. 

In the most basic case, if the user selects a set of points (genomic 
elements) in BED-format in addition to a Hi-C data track, one may 
ask whether all the genomic elements in the BED-file are more/less co- 
localized in 3D, in an all-versus-all fashion, than what would be expected 
by chance. In this case, the mean of the observed standardized interaction 
frequencies is compared to the expected value estimated from the per- 
muted positions in representative regions of the rest of the Hi-C (3D) 
track. This analysis was introduced in Paulsen et al. (2013), and in this 
article we expand the methodologies by allowing a much wider variety of 
query tracks. For example, by specifying two point-tracks (two BED 
files), in addition to a Hi-C (or similar) track, the user can ask whether 
the points in track 1 are more/less co-localized with track 2, than expected 
by chance. In this type of statistical question, the permutations can be 



performed on both of the point-tracks, or by preserving one of the point- 
tracks completely. 

It is also possible to specify particular interactions between a set of 
genomic elements, and compare these interactions with randomly per- 
muted interactions within the same set of elements. In HiBrowse, inter- 
actions between genomic elements are defined using LP, a format 
described in detail elsewhere (Gundersen et al, 2011). Such linked track 
types can easily be created by using a dedicated tool that converts from a 
simple BED file format containing information about which elements 
that should be linked together (see Supplementary Fig. SI, for an ex- 
ample). Since this type of analysis only permutes interactions intrinsically 
with regards to the query track, the positions of all elements will be 
completely preserved. This type of analysis should be used whenever 
specific interactions between genomic elements are considered, and it 
would be natural to compare with random links between the same elem- 
ents. Since regions of the genome can have varying properties (active/ 
inactive genes, open/closed chromatin, etc.), global shuffling of links be- 
tween all selected elements is not always preferable. To take such proper- 
ties into account during the permutation, each of the points can be 
marked by a value, such that the link-permutations will be performed 
by preserving the value-combinations on both sides of the links. 

If the user wants full control over exactly what pairs of interactions 
that are allowed to take part in the link-permutations, it is possible to 
specify a case/control value on each of the links via a dedicated tool 
which accepts two BED files ('case' and 'control') of the same format 
as described above (see Supplementary Fig. S2, for an example). The 
case/control-linked elements can then be selected together with a Hi-C 
(3D) track, allowing the user to compare the interaction frequency of all 
the links marked as 'case' with the expected interaction frequency given 
by permuting the case/control labels. This type of statistic is optimal for 
data that is only sampled from a pre-defined set of elements of the 
genome, and where the user wants to find out whether a subset of 
these elements are co-localized in 3D. 

Finally, it is possible to find statistically significant differences be- 
tween two Hi-C datasets, for example comparing treatments [as e.g. in 
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Fig. 1. (A) Overview of statistical hypothesis tests implemented in HiBrowse. See Gundersen et al. (2011) for an in-depth explanation of track types, and 
the Supplementary Material for details about each statistic. (B) Example of a HiBrowse analysis using the 'Linked elements more/less co-localized in 
3D?' statistic, investigating whether fusion transcripts are co-localized in 3D. (C) Result page from the analysis, presenting the question asked by the user 
together with both a simplistic and a more detailed answer giving the P-value and model assumption details. Links are provided to full details of the 
results at individual chromosome regions 
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Rickman et al. (2012)]. The statistical test implemented for this type of 
analysis is based on the edgeR-tool (Robinson et al., 2010). Details about 
the mathematical formulation of the different types of statistics and their 
corresponding null-hypotheses are found in the Supplementary Material. 

In addition to hypothesis tests, a range of descriptive statistics have 
been implemented. For example, each hypothesis test is accompanied by 
an enrichment score, giving the degree of over/under-representation of 
3D co-localization, compared to the expected 3D co-localization (see 
Supplementary Material for details). Other types of available descriptive 
statistics are visualization of clustered Hi-C matrices as heatmaps or 
graphs, principal component analysis on Hi-C matrices and other sum- 
mary statistics (see Supplementary Table S2 for a comprehensive list). All 
available analyses are described thoroughly on the help pages linked from 
the main site, where example histories are provided such that users can 
explore each statistic in detail. Demo-buttons are provided for all tools, 
giving small example runs. See Figure IB and C for an analysis example. 
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